{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Simple RAG (Retrieval-Augmented Generation) System for CSV Files\n",
        "\n",
        "## Overview\n",
        "\n",
        "This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.\n",
        "\n",
        "# CSV File Structure and Use Case\n",
        "The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.\n",
        "\n",
        "## Key Components\n",
        "\n",
        "1. Loading and spliting csv files.\n",
        "2. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings\n",
        "3. Query engine setup for querying the processed documents\n",
        "4. Creating a question and answer over the csv data.\n",
        "\n",
        "## Method Details\n",
        "\n",
        "### Document Preprocessing\n",
        "\n",
        "1. The csv is loaded using LlamaIndex's [PagedCSVReader](https://docs.llamaindex.ai/en/stable/api_reference/readers/file/#llama_index.readers.file.PagedCSVReader)\n",
        "2. This reader converts each row into a LlamaIndex Document along with the respective column names of the table. No further splitting applied.\n",
        "\n",
        "\n",
        "### Vector Store Creation\n",
        "\n",
        "1. OpenAI embeddings are used to create vector representations of the text chunks.\n",
        "2. A FAISS vector store is created from these embeddings for efficient similarity search.\n",
        "\n",
        "### Query Engine Setup\n",
        "\n",
        "1. A query engine is configured to fetch the most relevant chunks for a given query then answer the question.\n",
        "\n",
        "## Benefits of this Approach\n",
        "\n",
        "1. Scalability: Can handle large documents by processing them in chunks.\n",
        "2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.\n",
        "3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.\n",
        "4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.\n",
        "\n",
        "## Conclusion\n",
        "\n",
        "This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within a CSV file."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Package Installation and Imports\n",
        "\n",
        "The cell below installs all necessary packages required to run this notebook.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Install required packages\n",
        "!pip install faiss-cpu llama-index pandas python-dotenv"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {},
      "outputs": [],
      "source": [
        "from llama_index.core.readers import SimpleDirectoryReader\n",
        "from llama_index.core import Settings\n",
        "from llama_index.llms.openai import OpenAI\n",
        "from llama_index.embeddings.openai import OpenAIEmbedding\n",
        "from llama_index.readers.file import PagedCSVReader\n",
        "from llama_index.vector_stores.faiss import FaissVectorStore\n",
        "from llama_index.core.ingestion import IngestionPipeline\n",
        "from llama_index.core import VectorStoreIndex\n",
        "import faiss\n",
        "import os\n",
        "import pandas as pd\n",
        "from dotenv import load_dotenv\n",
        "\n",
        "\n",
        "# Load environment variables from a .env file\n",
        "load_dotenv()\n",
        "\n",
        "# Set the OpenAI API key environment variable\n",
        "os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
        "\n",
        "\n",
        "# Llamaindex global settings for llm and embeddings\n",
        "EMBED_DIMENSION=512\n",
        "Settings.llm = OpenAI(model=\"gpt-3.5-turbo\")\n",
        "Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-3-small\", dimensions=EMBED_DIMENSION)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### CSV File Structure and Use Case\n",
        "The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Download required data files\n",
        "import os\n",
        "os.makedirs('data', exist_ok=True)\n",
        "\n",
        "# Download the PDF document used in this notebook\n",
        "!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf\n",
        "!wget -O data/customers-100.csv https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/customers-100.csv\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "file_path = ('data/customers-100.csv') # insert the path of the csv file\n",
        "data = pd.read_csv(file_path)\n",
        "\n",
        "# Preview the csv file\n",
        "data.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Vector Store"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Create FaisVectorStore to store embeddings\n",
        "fais_index = faiss.IndexFlatL2(EMBED_DIMENSION)\n",
        "vector_store = FaissVectorStore(faiss_index=fais_index)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Load and Process CSV Data as Document"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {},
      "outputs": [],
      "source": [
        "csv_reader = PagedCSVReader()\n",
        "\n",
        "reader = SimpleDirectoryReader( \n",
        "    input_files=[file_path],\n",
        "    file_extractor= {\".csv\": csv_reader}\n",
        "    )\n",
        "\n",
        "docs = reader.load_data()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Index: 1\n",
            "Customer Id: DD37Cf93aecA6Dc\n",
            "First Name: Sheryl\n",
            "Last Name: Baxter\n",
            "Company: Rasmussen Group\n",
            "City: East Leonard\n",
            "Country: Chile\n",
            "Phone 1: 229.077.5154\n",
            "Phone 2: 397.884.0519x718\n",
            "Email: zunigavanessa@smith.info\n",
            "Subscription Date: 2020-08-24\n",
            "Website: http://www.stephenson.com/\n"
          ]
        }
      ],
      "source": [
        "# Check a sample chunk\n",
        "print(docs[0].text)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Ingestion Pipeline"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {},
      "outputs": [],
      "source": [
        "pipeline = IngestionPipeline(\n",
        "    vector_store=vector_store,\n",
        "    documents=docs\n",
        ")\n",
        "\n",
        "nodes = pipeline.run()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create Query Engine"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {},
      "outputs": [],
      "source": [
        "vector_store_index = VectorStoreIndex(nodes)\n",
        "query_engine = vector_store_index.as_query_engine(similarity_top_k=2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Query the rag bot with a question based on the CSV data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "'Rasmussen Group'"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "response = query_engine.query(\"which company does sheryl Baxter work for?\")\n",
        "response.response"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![](https://europe-west1-rag-techniques-views-tracker.cloudfunctions.net/rag-techniques-tracker?notebook=all-rag-techniques--simple-csv-rag-with-llamaindex)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "",
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}